Describe the work you have done this week and summarize your learning.
date()
## [1] "Thu Dec 1 18:04:45 2022"
Here we go again…
Obs! You might need to install some packages first if you haven’t used them before (see install.packages()).
#install.packages(c("tidyverse","GGally","readr", "MASS", "corrplot", "psych"))
Load the needed R packages before getting to work:
#load required packages
library(tidyverse)
library(GGally)
#library(dplyr)
library(ggplot2)
library(corrplot)
#library(stringr)
library(psych)
library(FactoMineR)
# load the 'human' data
human <- read.csv("human_row_names.csv", row.names = 1)
# Alternative data url
# human <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/human2.txt", row.names = 1)
# explore the dataset
str(human);dim(human)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
## [1] 155 8
Graphical overview of the data
# summary of the data
summary(human)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
# Summary plot matrix with correlation and density plots
p1 <- ggpairs(human, mapping = aes(alpha=0.3),
title="Descriptives and Pearson correlation",
lower = list(combo = wrap("facethist", bins = 20)))
p1
# compute the correlation matrix and visualize it with corrplot
#correaltion matrix
cor_matrix_h <- cor(human, method='spearman')
#p-values that correspond our correlation coefficients
cor_pmatrix_h <- corr.test(human, method = "spearman")$p
# visualize
corrplot(cor_matrix_h, type='upper', tl.col='black', col=COL2('BrBG'), addCoef.col='black', number.cex = 0.6, p.mat = cor_pmatrix_h, sig.level = 0.05)
All our variables are skewed. The significance values (p-values) seem to differ a little between the two plots. In our first plot p-values are determined as
Our second plot shows all Spearman’s correlation coefficients in numbers, the color and sizes of the circles, and correlations with p < 0.05 are seen stroke out.
This difference is due to ggpairs() (1st plot) using Pearson’s correlations and us assigning corrplot() (2nd plot) with Spearman’s correlation derived data. Since our variables seemed skewed, I prefer using Spearman’s correlation for bivariate correlation analysis.
There seems to be several strong and statistically significant correlations:
I have had a hard time understanding PCA, but our course material had a very clear intro to this: (Quoted directly from our course material:) *“Principal Component Analysis (PCA) can be performed by two sightly different matrix decomposition methods from linear algebra: the Eigenvalue Decomposition and the Singular Value Decomposition (SVD).
There are two functions in the default package distribution of R that
can be used to perform PCA: princomp() and
prcomp(). The prcomp() function uses the SVD
and is the preferred, more numerically accurate method.
Both methods quite literally decompose a data matrix into a product of smaller matrices, which let’s us extract the underlying principal components. This makes it possible to approximate a lower dimensional representation of the data by choosing only a few principal components.”*
We’ll first run a PCA on raw, non-standardized data (part 2) and then on standardized data (part 3). Finally, we’ll interpret the results on part 4 of this assignment.
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human)
# create and print out a summary of pca_human (= variability captured by the principal components)
s <- summary(pca_human)
summary(pca_human)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
# rounded percentanges of variance captured by each PC
pca_pr <- round(1*s$importance[2, ], digits = 1)*100
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 100 0 0 0 0 0 0 0
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
# draw a biplot
biplot(pca_human, choices=1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
All the data is clustered to the upper right corner of the plot. PC1 explains 100% of the total variance of the dataset. Because of a vector parallel to PC1, we know that GNI contributes mostly and only to PC1. This means that knowing the position of an observation in relation to PC1, it would be possible to have an accurate view of how the observation is related to other observations in our sample and basically we would only need GNI-information to determine that. Since the data is not scaled the high influence of GNI to PC1 is most probably due to GNI having the mos variance of the variables in our dataset!
# standardize the variables
human_std <- scale(human)
# print out summaries of the standardized variables
summary(human_std)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7378 Min. :-2.7188
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6782 1st Qu.:-0.6425
## Median : 0.3503 Median : 0.2316 Median : 0.1140 Median : 0.3056
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.7126 3rd Qu.: 0.6717
## Max. : 2.6646 Max. : 1.6632 Max. : 2.4730 Max. : 1.4218
## GNI Mat.Mor Ado.Birth Parli.F
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
# perform principal component analysis (with the SVD method)
pca_human_std <- prcomp(human_std)
pca_human_std
## Standard deviations (1, .., p=8):
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## Edu2.FM -0.35664370 0.03796058 -0.24223089 0.62678110 -0.5983585
## Labo.FM 0.05457785 0.72432726 -0.58428770 0.06199424 0.2625067
## Edu.Exp -0.42766720 0.13940571 -0.07340270 -0.07020294 0.1659678
## Life.Exp -0.44372240 -0.02530473 0.10991305 -0.05834819 0.1628935
## GNI -0.35048295 0.05060876 -0.20168779 -0.72727675 -0.4950306
## Mat.Mor 0.43697098 0.14508727 -0.12522539 -0.25170614 -0.1800657
## Ado.Birth 0.41126010 0.07708468 0.01968243 0.04986763 -0.4672068
## Parli.F -0.08438558 0.65136866 0.72506309 0.01396293 -0.1523699
## PC6 PC7 PC8
## Edu2.FM 0.17713316 0.05773644 0.16459453
## Labo.FM -0.03500707 -0.22729927 -0.07304568
## Edu.Exp -0.38606919 0.77962966 -0.05415984
## Life.Exp -0.42242796 -0.43406432 0.62737008
## GNI 0.11120305 -0.13711838 -0.16961173
## Mat.Mor 0.17370039 0.35380306 0.72193946
## Ado.Birth -0.76056557 -0.06897064 -0.14335186
## Parli.F 0.13749772 0.00568387 -0.02306476
# create and print out a summary of pca_human (= variability captured by the principal components)
s_hs <- summary(pca_human_std)
summary(pca_human_std)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
## PC8
## Standard deviation 0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion 1.00000
# rounded percentanges of variance captured by each PC
pca_pr_hs <- round(1*s_hs$importance[2, ], digits = 1)*100
pca_pr_hs
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 50 20 10 10 10 0 0 0
# create object pc_lab to be used as axis labels
pc_lab_hs <- paste0(names(pca_pr_hs), " (", pca_pr_hs, "%)")
# draw a biplot
biplot(pca_human_std, choices=1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab_hs[1], ylab = pc_lab_hs[2])
The plot with standardized data look very different from the earlier
biplot made with unstandardized data. Also, the variability captured by
the principal components seems to be distributed more realistically
between PCs. The results between unstandardized and standardized data
PCA are different since PCA requires data scaling for variable
comparison.
For the PCA with standardized data:
In summary, the data should be scaled before PCA for the analysis to be trustworthy.
I’ll elaborate on the results of the PCA with standardized data:
PC1 explains the most (53,6%) of the variability in the dataset, and the variables that mostly affect it are:
The four variables with a positive effect on PC1 are positively correlated with each other and the ones with a negative effect (two) are correlated positively with each other and negatively with the prior four.
PC2 explains the second most (16,2%) of the variability in the dataset, and the variables that mostly affect it are:
These two variables are positively correlated with each other and not with any of the variables affecting PC1 (since the arrows are perpendicular to the others).
According to our assignment the tea data comes from the FactoMineR package and it is measured with a questionnaire on tea: 300 individuals were asked how they drink tea (18 questions) and what are their product’s perception (12 questions). In addition, some personal details were asked (4 questions).
We’ll first load the data (wrangled by our course professor K.V.) and then start exploring and analyzing it!
# load the data
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
# Structure and dimensions of the data
str(tea);dim(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
## $ frequency : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
## [1] 300 36
# View data
View(tea)
Let’s visualize the data:
# summary of the data
summary(tea)
## breakfast tea.time evening lunch
## breakfast :144 Not.tea time:131 evening :103 lunch : 44
## Not.breakfast:156 tea time :169 Not.evening:197 Not.lunch:256
##
##
##
##
##
## dinner always home work
## dinner : 21 always :103 home :291 Not.work:213
## Not.dinner:279 Not.always:197 Not.home: 9 work : 87
##
##
##
##
##
## tearoom friends resto pub
## Not.tearoom:242 friends :196 Not.resto:221 Not.pub:237
## tearoom : 58 Not.friends:104 resto : 79 pub : 63
##
##
##
##
##
## Tea How sugar how
## black : 74 alone:195 No.sugar:155 tea bag :170
## Earl Grey:193 lemon: 33 sugar :145 tea bag+unpackaged: 94
## green : 33 milk : 63 unpackaged : 36
## other: 9
##
##
##
## where price age sex
## chain store :192 p_branded : 95 Min. :15.00 F:178
## chain store+tea shop: 78 p_cheap : 7 1st Qu.:23.00 M:122
## tea shop : 30 p_private label: 21 Median :32.00
## p_unknown : 12 Mean :37.05
## p_upscale : 53 3rd Qu.:48.00
## p_variable :112 Max. :90.00
##
## SPC Sport age_Q frequency
## employee :59 Not.sportsman:121 +60 :38 +2/day :127
## middle :40 sportsman :179 15-24:92 1 to 2/week: 44
## non-worker :64 25-34:69 1/day : 95
## other worker:20 35-44:40 3 to 6/week: 34
## senior :35 45-59:61
## student :70
## workman :12
## escape.exoticism spirituality healthy
## escape-exoticism :142 Not.spirituality:206 healthy :210
## Not.escape-exoticism:158 spirituality : 94 Not.healthy: 90
##
##
##
##
##
## diuretic friendliness iron.absorption
## diuretic :174 friendliness :242 iron absorption : 31
## Not.diuretic:126 Not.friendliness: 58 Not.iron absorption:269
##
##
##
##
##
## feminine sophisticated slimming exciting
## feminine :129 Not.sophisticated: 85 No.slimming:255 exciting :116
## Not.feminine:171 sophisticated :215 slimming : 45 No.exciting:184
##
##
##
##
##
## relaxing effect.on.health
## No.relaxing:113 effect on health : 66
## relaxing :187 No.effect on health:234
##
##
##
##
##
# Summary plot matrix grouped by sex and age quartile
tea_long <- tea %>%
pivot_longer(cols=-c(sex,age_Q, age), names_to = "variable",values_to = 'value') %>%
group_by(sex,age_Q,variable,value) %>%
summarise(n=n())
tea_long
## # A tibble: 764 × 5
## # Groups: sex, age_Q, variable [330]
## sex age_Q variable value n
## <fct> <fct> <chr> <fct> <int>
## 1 F +60 always always 2
## 2 F +60 always Not.always 21
## 3 F +60 breakfast breakfast 8
## 4 F +60 breakfast Not.breakfast 15
## 5 F +60 dinner dinner 1
## 6 F +60 dinner Not.dinner 22
## 7 F +60 diuretic diuretic 14
## 8 F +60 diuretic Not.diuretic 9
## 9 F +60 effect.on.health effect on health 6
## 10 F +60 effect.on.health No.effect on health 17
## # … with 754 more rows
# Barplots of all variables
# visualize the dataset
pivot_longer(tea, cols = -c(age)) %>%
ggplot(aes(value)) + facet_wrap("name", scales = "free", ncol=6) +
geom_bar() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
Unfortunately there is so much data we don’t get good looking plots.
We’ll have to choose some of the most interesting variables to plot:
# column names to keep in the dataset
keep_columns <- c("age_Q", "sex", "Tea", "How", "price", "SPC", "Sport", "frequency", "relaxing", "healthy", "sophisticated")
# select the 'keep_columns' to create a new dataset
tea_time <- select(tea, all_of(keep_columns))
# look at the summaries and structure of the data
summary(tea_time)
## age_Q sex Tea How price
## +60 :38 F:178 black : 74 alone:195 p_branded : 95
## 15-24:92 M:122 Earl Grey:193 lemon: 33 p_cheap : 7
## 25-34:69 green : 33 milk : 63 p_private label: 21
## 35-44:40 other: 9 p_unknown : 12
## 45-59:61 p_upscale : 53
## p_variable :112
##
## SPC Sport frequency relaxing
## employee :59 Not.sportsman:121 +2/day :127 No.relaxing:113
## middle :40 sportsman :179 1 to 2/week: 44 relaxing :187
## non-worker :64 1/day : 95
## other worker:20 3 to 6/week: 34
## senior :35
## student :70
## workman :12
## healthy sophisticated
## healthy :210 Not.sophisticated: 85
## Not.healthy: 90 sophisticated :215
##
##
##
##
##
pivot_longer(tea_time, cols = everything()) %>%
ggplot(aes(value)) + facet_wrap("name", scales = "free", ncol=6) +
geom_bar() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
Again, quoted a very insightful part of our course material: Multiple Correspondence Analysis (MCA) is a method to analyze qualitative data and it is an extension of Correspondence analysis (CA). MCA can be used to detect patterns or structure in the data as well as in dimension reduction.
Since there are multiple variables. We’ll choose and focus on only the ones we previously chose: “age_Q”, “sex”, “Tea”, “How”, “price”, “SPC”, “Sport”, “frequency”, “relaxing”, “healthy”, “sophisticated” and run a MCA.
# multiple correspondence analysis
mca <- MCA(tea_time, graph = TRUE)
## Warning: ggrepel: 9 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
# summary of the model
mca
## **Results of the Multiple Correspondence Analysis (MCA)**
## The analysis was performed on 300 individuals, described by 11 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. of the categories"
## 4 "$var$cos2" "cos2 for the categories"
## 5 "$var$contrib" "contributions of the categories"
## 6 "$var$v.test" "v-test for the categories"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "intermediate results"
## 12 "$call$marge.col" "weights of columns"
## 13 "$call$marge.li" "weights of rows"
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = TRUE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.205 0.179 0.138 0.129 0.126 0.125 0.114
## % of var. 8.059 7.023 5.404 5.073 4.946 4.910 4.486
## Cumulative % of var. 8.059 15.082 20.485 25.558 30.504 35.414 39.899
## Dim.8 Dim.9 Dim.10 Dim.11 Dim.12 Dim.13 Dim.14
## Variance 0.108 0.104 0.102 0.099 0.096 0.090 0.089
## % of var. 4.251 4.068 3.994 3.889 3.790 3.521 3.508
## Cumulative % of var. 44.150 48.218 52.212 56.101 59.891 63.412 66.920
## Dim.15 Dim.16 Dim.17 Dim.18 Dim.19 Dim.20 Dim.21
## Variance 0.085 0.082 0.076 0.074 0.071 0.068 0.065
## % of var. 3.353 3.209 3.003 2.896 2.809 2.667 2.536
## Cumulative % of var. 70.273 73.482 76.485 79.382 82.190 84.857 87.393
## Dim.22 Dim.23 Dim.24 Dim.25 Dim.26 Dim.27 Dim.28
## Variance 0.061 0.056 0.055 0.053 0.044 0.028 0.024
## % of var. 2.406 2.208 2.160 2.071 1.720 1.091 0.952
## Cumulative % of var. 89.798 92.007 94.166 96.237 97.957 99.048 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr cos2
## 1 | 0.205 0.068 0.009 | 0.966 1.740 0.207 | 0.345 0.289 0.027
## 2 | 0.298 0.145 0.036 | 0.576 0.618 0.135 | 0.099 0.024 0.004
## 3 | -0.118 0.023 0.006 | 0.135 0.034 0.007 | 0.034 0.003 0.000
## 4 | -0.497 0.401 0.183 | -0.191 0.068 0.027 | -0.017 0.001 0.000
## 5 | -0.237 0.091 0.031 | 0.274 0.140 0.042 | 0.078 0.015 0.003
## 6 | -0.800 1.040 0.253 | 0.002 0.000 0.000 | 0.364 0.322 0.053
## 7 | -0.020 0.001 0.000 | 0.542 0.549 0.107 | 0.441 0.472 0.070
## 8 | 0.197 0.063 0.014 | 0.231 0.099 0.019 | 0.337 0.275 0.041
## 9 | 0.195 0.062 0.014 | 0.569 0.603 0.119 | 0.617 0.922 0.140
## 10 | 0.496 0.400 0.090 | 0.323 0.195 0.038 | 0.553 0.742 0.112
##
## 1 |
## 2 |
## 3 |
## 4 |
## 5 |
## 6 |
## 7 |
## 8 |
## 9 |
## 10 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2 v.test
## +60 | 1.564 13.735 0.355 10.301 | -1.327 11.337 0.255 -8.737 |
## 15-24 | -1.100 16.446 0.535 -12.650 | -0.578 5.203 0.148 -6.643 |
## 25-34 | -0.125 0.159 0.005 -1.182 | 0.668 5.214 0.133 6.310 |
## 35-44 | 0.543 1.741 0.045 3.682 | 0.521 1.842 0.042 3.535 |
## 45-59 | 0.470 1.992 0.056 4.107 | 0.601 3.730 0.092 5.247 |
## F | -0.088 0.203 0.011 -1.836 | -0.402 4.872 0.236 -8.393 |
## M | 0.128 0.297 0.011 1.836 | 0.586 7.108 0.236 8.393 |
## black | 0.708 5.484 0.164 7.008 | -0.092 0.107 0.003 -0.913 |
## Earl Grey | -0.375 4.017 0.254 -8.716 | 0.000 0.000 0.000 -0.010 |
## green | 0.607 1.795 0.046 3.689 | 0.209 0.245 0.005 1.272 |
## Dim.3 ctr cos2 v.test
## +60 0.600 3.011 0.052 3.950 |
## 15-24 0.209 0.882 0.019 2.399 |
## 25-34 -0.358 1.945 0.038 -3.380 |
## 35-44 -0.020 0.003 0.000 -0.134 |
## 45-59 -0.271 0.984 0.019 -2.364 |
## F -0.247 2.393 0.089 -5.159 |
## M 0.360 3.491 0.089 5.159 |
## black 0.673 7.394 0.149 6.664 |
## Earl Grey -0.130 0.719 0.030 -3.020 |
## green -0.750 4.086 0.069 -4.557 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## age_Q | 0.769 0.537 0.103 |
## sex | 0.011 0.236 0.089 |
## Tea | 0.255 0.007 0.185 |
## How | 0.137 0.083 0.061 |
## price | 0.106 0.115 0.351 |
## SPC | 0.697 0.618 0.328 |
## Sport | 0.140 0.061 0.104 |
## frequency | 0.019 0.135 0.104 |
## relaxing | 0.069 0.120 0.034 |
## healthy | 0.033 0.030 0.081 |
# visualize MCA
plot(mca, invisible=c("ind"), graph.type = "classic", habillage = "quali") #individuals (ind) invisible = plots variables
plot(mca, invisible=c("var"), graph.type = "classic") #variables (var) invisible = plots individuals
Our first and fourth plots (MCA factor map) show variable categories with similar profiles grouped together;
‘non-workers’, ‘+60’ year olds and drinking tea in an ‘other’ way (i.e. not alone, not with lemon nor milk) have similar profiles. Also ‘student’ and ‘15-24’ year old seem to have similar profiles.
‘non-workers’, ‘+60’ year olds and ‘other’ profiles seem to be negatively correlated with ‘student’ and ‘15-24’ yo categories (opposite sides of plot origin), and especially with ‘25-34’ year old and ‘workman’.
‘non-workers’, ‘+60’ year olds, ‘other’ kind of drinkers, ‘student’, ‘15-24’ year olds, ‘senior’, and ‘p_unknown’ (tea price unknown) seem to be the variable categories that are most represented in our data (tehy are the farthers of the origo).
From the third plot (correlation between variables and principal dimensions) we can see that
To assess whether the variable categories differ significantly from each other we’ll draw confidence ellipses, which is possible with ‘FactoMineR’:
plotellipses(mca)
The plots are hard to read, so we’ll focus on specific variables 4 at a time.
plotellipses(mca, keepvar = c("sex", "Tea", "age_Q", "How"))
plotellipses(mca, keepvar = c("Sport", "price", "frequency", "SPC"))
plotellipses(mca, keepvar = c("relaxing", "healthy", "sophisticated"))
This shows us:
✓ ✓ ✓ ✓ ✓ Ch5 done!